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Abstract 

The choice of approximate posterior distribution 
is one of the core problems in variational infer¬ 
ence. Most applications of variational inference 
employ simple families of posterior approxima¬ 
tions in order to allow for efficient inference, fo¬ 
cusing on mean-field or other simple structured 
approximations. This restriction has a signifi¬ 
cant impact on the quality of inferences made 
using variational methods. We introduce a new 
approach for specifying fiexible, arbitrarily com¬ 
plex and scalable approximate posterior distribu¬ 
tions. Our approximations are distributions con¬ 
structed through a normalizing fiow, whereby a 
simple initial density is transformed into a more 
complex one by applying a sequence of invertible 
transformations until a desired level of complex¬ 
ity is attained. We use this view of normalizing 
fiows to develop categories of finite and infinites¬ 
imal flows and provide a unified view of ap¬ 
proaches for constructing rich posterior approxi¬ 
mations. We demonstrate that the theoretical ad¬ 
vantages of having posteriors that better match 
the true posterior, combined with the scalability 
of amortized variational approaches, provides a 
clear improvement in performance and applica¬ 
bility of variational inference. 

1. Introduction 

There has been a great deal of renewed interest in varia¬ 
tional inference as a means of scaling probabilistic mod¬ 
eling to increasingly complex problems on increasingly 
larger data sets. Variational inference now lies at the core of 
large-scale topic models of text (Hoffman et al., 2013), pro¬ 
vides the state-of-the-art in semi-supervised classification 
(Kingma et al., 2014), drives the models that currently pro¬ 
duce the most realistic generative models of images (Gre¬ 
gor et al., 2014; 2015; Rezende et al., 2014; Kingma & 
Welling, 2014), and are a default tool for the understanding 
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of many physical and chemical systems. Despite these suc¬ 
cesses and ongoing advances, there are a number of disad¬ 
vantages of variational methods that limit their power and 
hamper their wider adoption as a default method for statis¬ 
tical inference. It is one of these limitations, the choice of 
posterior approximation, that we address in this paper. 

Variational inference requires that intractable posterior dis¬ 
tributions be approximated by a class of known probability 
distributions, over which we search for the best approxima¬ 
tion to the true posterior. The class of approximations used 
is often limited, e.g., mean-field approximations, implying 
that no solution is ever able to resemble the true posterior 
distribution. This is a widely raised objection to variational 
methods, in that unlike other inferential methods such as 
MCMC, even in the asymptotic regime we are unable re¬ 
cover the true posterior distribution. 

There is much evidence that richer, more faithful posterior 
approximations do result in better performance. For exam¬ 
ple, when compared to sigmoid belief networks that make 
use of mean-field approximations, deep auto-regressive 
networks use a posterior approximation with an auto¬ 
regressive dependency structure that provides a clear im¬ 
provement in performance (Mnih & Gregor, 2014). There 
is also a large body of evidence that describes the detri¬ 
mental effect of limited posterior approximations. Turner 
& Sahani (2011) provide an exposition of two commonly 
experienced problems. The first is the widely-observed 
problem of under-estimation of the variance of the poste¬ 
rior distribution, which can result in poor predictions and 
unreliable decisions based on the chosen posterior approx¬ 
imation. The second is that the limited capacity of the pos¬ 
terior approximation can also result in biases in the MAP 
estimates of any model parameters (and this is the case e.g., 
in time-series models). 

A number of proposals for rich posterior approximations 
have been explored, typically based on structured mean- 
field approximations that incorporate some basic form of 
dependency within the approximate posterior. Another po¬ 
tentially powerful alternative would be to specify the ap¬ 
proximate posterior as a mixture model, such as those de¬ 
veloped by Jaakkola & Jordan (1998); Jordan et al. (1999); 
Gershman et al. (2012). But the mixture approach limits 
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the potential scalability of variational inference since it re¬ 
quires evaluation of the log-likelihood and its gradients for 
each mixture component per parameter update, which is 
typically computationally expensive. 

This paper presents a new approach for specifying approx¬ 
imate posterior distributions for variational inference. We 
begin by reviewing the current best practice for inference 
in general directed graphical models, based on amortized 
variational inference and efficient Monte Carlo gradient es¬ 
timation, in section 2. We then make the following contri¬ 
butions: 

• We propose the specification of approximate poste¬ 
rior distributions using normalizing fiows, a tool for 
constructing complex distributions by transforming a 
probability density through a series of invertible map¬ 
pings (sect. 3). Inference with normalizing fiows pro¬ 
vides a tighter, modified variational lower bound with 
additional terms that only add terms with linear time 
complexity (sect 4). 

• We show that normalizing fiows admit infinitesimal 
fiows that allow us to specify a class of posterior ap¬ 
proximations that in the asymptotic regime is able to 
recover the true posterior distribution, overcoming one 
oft-quoted limitation of variational inference. 

• We present a unified view of related approaches for 
improved posterior approximation as the application 
of special types of normalizing fiows (sect 5). 

• We show experimentally that the use of general nor¬ 
malizing fiows systematically outperforms other com¬ 
peting approaches for posterior approximation. 

2. Amortized Variational Inference 

To perform inference it is sufficient to reason using the 
marginal likelihood of a probabilistic model, and requires 
the marginalization of any missing or latent variables in 
the model. This integration is typically intractable, and 
instead, we optimize a lower bound on the marginal like¬ 
lihood. Consider a general probabilistic model with ob¬ 
servations X, latent variables z over which we must inte¬ 
grate, and model parameters 6. We introduce an approxi¬ 
mate posterior distribution for the latent variables g' 0 (z|x) 
and follow the variational principle (Jordan et al., 1999) to 
obtain a bound on the marginal likelihood: 

logp 0 (x) =log j P0{x\z)p{z)dz (1) 

= log f ^'^|^|^| pe(x|z)p(z)dz (2) 

>-IDKL[g0(z|x)||p(z)]+Eg[logp0(x|z)]=-J(x), ( 3 ) 

where we used Jensen’s inequality to obtain the final equa¬ 
tion, pq{x.\z) is a likelihood function and p{z) is a prior 
over the latent variables. We can easily extend this for¬ 
mulation to posterior inference over the parameters 6, but 


we will focus on inference over the latent variables only. 
This bound is often referred to as the negative free energy 

or as the evidence lower bound (ELBO). It consists of 
two terms: the first is the KL divergence between the ap¬ 
proximate posterior and the prior distribution (which acts 
as a regularizer), and the second is a reconstruction error. 
This bound (3) provides a unified objective function for op¬ 
timization of both the parameters 6 and 0 of the model and 
variational approximation, respectively. 

Current best practice in variational inference performs 
this optimization using mini-batches and stochastic gra¬ 
dient descent, which is what allows variational infer¬ 
ence to be scaled to problems with very large data 
sets. There are two problems that must be addressed 
to successfully use the variational approach: 1 ) effi¬ 
cient computation of the derivatives of the expected log- 
likelihood [logp 0 (x|z)], and 2 ) choosing the 

richest, computationally-feasible approximate posterior 
distribution q{'). The second problem is the focus of this 
paper. To address the first problem, we make use of two 
tools: Monte Carlo gradient estimation and inference net¬ 
works, which when used together is what we refer to as 
amortized variational inference. 

2.1. Stochastic Backpropagation 

The bulk of research in variational inference over the years 
has been on ways in which to compute the gradient of the 
expected log-likelihood [logp(x|z)]. Whereas 

we would have previously resorted to local variational 
methods (Bishop, 2006), in general we now always com¬ 
pute such expectations using Monte Carlo approximations 
(including the KL term in the bound, if it is not analytically 
known). This forms what has been aptly named doubly- 
stochastic estimation (Titsias & Lazaro-Gredilla, 2014), 
since we have one source of stochasticity from the mini¬ 
batch and a second from the Monte Carlo approximation of 
the expectation. 

We focus on models with continuous latent variables, and 
the approach we take computes the required gradients us¬ 
ing a non-centered reparameterization of the expectation 
(Papaspiliopoulos et al., 2003; Williams, 1992), combined 
with Monte Carlo approximation — referred to as stochas¬ 
tic backpropagation (Rezende et al., 2014). This approach 
has also been referred to or as stochastic gradient vari¬ 
ational Bayes (SGVB) (Kingma & Welling, 2014) or as 
affine variational inference (Challis & Barber, 2012). 

Stochastic backpropagation involves two steps: 

• Reparameterization. We reparameterize the latent 
variable in terms of a known base distribution and 
a differentiable transformation (such as a location- 
scale transformation or cumulative distribution func¬ 
tion). For example, if q(f){z) is a Gaussian distribution 
cr^), with f cr^}, then the location-scale 
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transformation using the standard Normal as a base 
distribution allows us to reparameterize z as: 

z ^ cr^) 2 ; = + ere, e ^ A/'(0,1) 

• Backpropagation with Monte Carlo. We can now 

differentiate (backpropagation) w.r.t. the parameters 
(j) of the variational distribution using a Monte Carlo 
approximation with draws from the base distribution: 

[fe{z)] ^ E^(e|o,i) [V<^/0 (m + 0-e)] • 

A number of general purpose approaches based on Monte 
Carlo control variate (MCCV) estimators exist as an alter¬ 
native to stochastic backpropagation, and allow for gradi¬ 
ent computation with latent variables that may be contin¬ 
uous or discrete (Williams, 1992; Mnih & Gregor, 2014; 
Ranganath et al., 2013; Wingate & Weber, 2013). An im¬ 
portant advantage of stochastic backpropagation is that, for 
models with continuous latent variables, it has the lowest 
variance among competing estimators. 

2.2. Inference Networks 

A second important practice is that the approximate pos¬ 
terior distribution g 0 (') is represented using a recognition 
model or inference network (Rezende et al., 2014; Dayan, 
2000; Gershman & Goodman, 2014; Kingma & Welling, 
2014). An inference network is a model that learns an 
inverse map from observations to latent variables. Us¬ 
ing an inference network, we avoid the need to compute 
per data point variational parameters, but can instead com¬ 
pute a set of global variational parameters 4> valid for in¬ 
ference at both training and test time. This allows us to 
amortize the cost of inference by generalizing between the 
posterior estimates for all latent variables through the pa¬ 
rameters of the inference network. The simplest inference 
models that we can use are diagonal Gaussian densities, 
g^(z|x) = A/'(z|/i^(x),diag(cr|(x))), where the mean 
function ^^^(x) and the standard-deviation function cr 0 (x) 
are specified using deep neural networks. 

2.3. Deep Latent Gaussian Models 

In this paper, we study deep latent Gaussian models 
(DLGM), which are a general class of deep directed graph¬ 
ical models that consist of a hierarchy of L layers of Gaus¬ 
sian latent variables z/ for layer 1. Each layer of latent vari¬ 
ables is dependent on the layer above in a non-linear way, 
and for DLGMs, this non-linear dependency is specified by 
deep neural networks. The joint probability model is: 

L 

P(x,Zi,...,Zl) =p(x|/o(zi))P[p(Zi|/;(z;+i)) (4) 

1 = 1 

where the Lth Gaussian distribution is not dependent on 
any other random variables. The prior over latent vari¬ 
ables is a unit Gaussian p(z^) = A/’( 0 ,1) and the observa¬ 
tion likelihood p 0 (xIz) is any appropriate distribution that 


is conditioned on zi and is also parameterized by a deep 
neural network (figure 2). This model class is very gen¬ 
eral and includes other models such as factor analysis and 
PC A, non-linear factor analysis, and non-linear Gaussian 
belief networks as special cases (Rezende et al., 2014). 

DLGMs use continuous latent variables and is a model 
class perfectly suited to fast amortized variational inference 
using the lower bound (3) and stochastic backpropagation. 
The end-to-end system of DLGM and inference network 
can be viewed as an encoder-decoder architecture, and this 
is the perspective taken by Kingma & Welling (2014) who 
present this combination of model and inference strategy 
as a variational auto-encoder. The inference networks used 
in Kingma & Welling (2014); Rezende et al. (2014) are 
simple diagonal or diagonal-plus-low rank Gaussian distri¬ 
butions. The true posterior distribution will be more com¬ 
plex than this assumption allows for, and defining multi¬ 
modal and constrained posterior approximations in a scal¬ 
able manner remains a significant open problem in varia¬ 
tional inference. 

3. Normalizing Flows 

By examining the bound (3), we can see that the optimal 
variational distribution that allows IDkl[^||p] = 0 is one 
for which g' 0 (z|x) = p 0 (z|x), i.e. q matches the true pos¬ 
terior distribution. This possibility is obviously not realiz¬ 
able given the typically used g(') distributions, such as in¬ 
dependent Gaussians or other mean-field approximations. 
Indeed, one limitation of the variational methodology due 
to the available choices of approximating families, is that 
even in an asymptotic regime we can not obtain the true 
posterior. Thus, an ideal family of variational distributions 
g 0 (z|x) is one that is highly fiexible, preferably fiexible 
enough to contain the true posterior as one solution. One 
path towards this ideal is based on the principle of nor¬ 
malizing flows (Tabak & Turner, 2013; Tabak & Vanden- 
Eijnden, 2010). 

A normalizing flow describes the transformation of a prob¬ 
ability density through a sequence of invertible mappings. 
By repeatedly applying the rule for change of variables, 
the initial density ‘flows’ through the sequence of invert¬ 
ible mappings. At the end of this sequence we obtain a 
valid probability distribution and hence this type of flow is 
referred to as a normalizing fiow. 

3.1. Finite Flows 

The basic rule for transformation of densities considers an 
invertible, smooth mapping / : IR^ —> IR^ with inverse 
f~^ = g, i.e. the composition g o f(^z) = z. If we use 
this mapping to transform a random variable z with distri¬ 
bution q{z), the resulting random variable z' = /(z) has a 
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distribution : 


q{z') = q{z) 


det 


df- 


9z' 


q{z) 


, df 


-1 


(5) 


where the last equality can be seen by applying the chain 
rule (inverse function theorem) and is a property of Jaco- 
bians of invertible functions. We can construct arbitrarily 
complex densities by composing several simple maps and 
successively applying (5). The density ^^(z) obtained by 
successively transforming a random variable zq with distri¬ 
bution go through a chain of K transformations fk is: 


Zif = /if o . . . o /2 o /i(zo) 
K 

Ingif(zif) = Ingo(zo) - V^ln 

fc=i 


det 


dfk 


dzk-i 


( 6 ) 

(7) 


partial differential equation describing how the initial den¬ 
sity go(z) evolves over Time’: ^gt(z) = 7t[gt(z)], where 
T describes the continuous-time dynamics. 

Langevin Flow. One important family of flows is given by 
the Langevin stochastic differential equation (SDE): 

dz{t) = F{z{t),t)dt + G{z{t),t)d^{t), (9) 

where d^{t) is a Wiener process with E[^^(t)] = 0 and 
= SijS{t — t'), F is the drift vector and 
D = GG^ is the diffusion matrix. If we transform a 
random variable z with initial density go(z) through the 
Langevin flow (9), then the rules for the transformation 
of densities is given by the Fokker-Planck equation (or 
Kolmogorov equations in probability theory). The density 
gt(z) of the transformed samples at time t will evolve as: 


where equation (6) will be used throughout the paper as a 
shorthand for the composition /k(/k-i(. .. /i(x))). The 
path traversed by the random variables z/c = //c(z/c_i) with 
initial distribution go(zo) is called the and the path 
formed by the successive distributions qk is a normalizing 
flow. A property of such transformations, often referred 
to as the law of the unconscious statistician (LOTUS), is 
that expectations w.r.t. the transformed density qk can be 
computed without explicitly knowing g^. Any expectation 
Eg^ [h{z)] can be written as an expectation under go as: 

E-Jif [^(z)] = ^qoiKfK ° fx-l o . . . o /l(zo))], (8) 

which does not require computation of the the logdet- 
Jacobian terms when h{z) does not depend on qk- 

We can understand the effect of invertible flows as a se¬ 
quence of expansions or contractions on the initial density. 
For an expansion, the map z' = /(z) pulls the points z 
away from a region in IR^, reducing the density in that re¬ 
gion while increasing the density outside the region. Con¬ 
versely, for a contraction, the map pushes points towards 
the interior of a region, increasing the density in its interior 
while reducing the density outside. 

The formalism of normalizing flows now gives us a sys¬ 
tematic way of specifying the approximate posterior distri¬ 
butions g(z|x) required for variational inference. With an 
appropriate choice of transformations /k, we can initially 
use simple factorized distributions such as an independent 
Gaussian, and apply normalizing flows of different lengths 
to obtain increasingly complex and multi-modal distribu¬ 
tions. 

3.2. Infinitesimal Flows 

It is natural to consider the case in which the length of the 
normalizing flow tends to inflnity. In this case, we obtain 
an inflnitesimal flow, that is described not in terms of a fi¬ 
nite sequence of transformations — a finite flow, but as a 


In machine learning, we most often use the Langevin flow 
with F{zfl) = —VzC(z) and G(z,t) = V^Sij, where 
C{z) is an unnormalised log-density of our model. 

Importantly, in this case the stationary solution for gt(z) 
is given by the Boltzmann distribution: goo(z) cx: 

That is, if we start from an initial density go(z) and 
evolve its samples zq through the Langevin SDE, the re¬ 
sulting points Zoo will be distributed according to goo(z) oc 
g-/:(z)^ i.e. the true posterior. This approach has been ex¬ 
plored for sampling from complex densities by Welling & 
Teh (2011); Ahn et al. (2012); Suykens et al. (1998). 

Hamiltonian Flow. Hamiltonian Monte Carlo can also be 
described in terms of a normalizing flow on an augmented 
space z = (z, cj) with dynamics resulting from the Hamil¬ 
tonian H(z, cj) = —jC{z) — HMC is also widely 

used in machine learning, e.g., Neal (201 1). We will use the 
Hamiltonian flow to make a connection to the recently in¬ 
troduced Hamiltonian variational approach from Salimans 
et al. (2015) in section 5. 

4. Inference with Normalizing Flows 

To allow for scalable inference using finite normalizing 
flows, we must specify a class of invertible transformations 
that can be used and an efficient mechanism for computing 
the determinant of the Jacobian. While it is straightforward 
to build invertible parametric functions for use in equa¬ 
tion (5), e.g., invertible neural networks (Baird et al., 2005; 
Rippel & Adams, 2013), such approaches typically have 
a complexity for computing the Jacobian determinant that 
scales as 0{LD^), where D is the dimension of the hidden 
layers and L is the number of hidden layers used. Further¬ 
more, computing the gradients of the Jacobian determinant 
involves several additional operations that are also 0{LD^) 
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and involve matrix inverses that can be numerically unsta¬ 
ble. We therefore require normalizing flows that allow for 
low-cost computation of the determinant, or where the Ja¬ 
cobian is not needed at all. 


4.1. Invertible Linear-time Transformations 


We consider a family of transformations of the form: 

/(z) = z + u/i(w^z + 6), (10) 


where A = {w G IR^,u G IR ^,6 G IR} are free pa¬ 
rameters and h{’) is a smooth element-wise non-linearity, 
with derivative For this mapping we can compute 

the logdet-Jacobian term in 0{D) time (using the matrix 
determinant lemma): 


'0(z) = h'{w^z + b)w 


( 11 ) 


det ^ 


= I det(I 4- u'0(z)^)| = |1 + u^'0(z)|. (12) 


From (7) we conclude that the density ^^(z) obtained by 
transforming an arbitrary initial density go(z) through the 
sequence of maps fk of the form ( 10 ) is implicitly given 
by: 


ZK = Ik-i o ... o /i(z) 

K 

\iiqK{zK) = \nqo{z) - ^ In |1 + u^'0/c(z/c-i)|. (13) 

k=l 

The flow deflned by the transformation (13) modifies the 
initial density go by applying a series of contractions and 
expansions in the direction perpendicular to the hyperplane 
w^z + 6 = 0 , hence we refer to these maps as planar flows. 

As an alternative, we can consider a family of transforma¬ 
tions that modify an initial density go around a reference 
point zo. The transformation family is: 


/(z) = 
, df 

det — 
oz 


z + (3h{a,r){z-zo), (14) 

= [1 + f5h{a, r)f~^ [1 + Ph{a, r) + I3h'{a, r)r)], 


where r = |z — zo|, h{a^ r) = 1 /(q( -h r), and the param¬ 
eters of the map are A = {zo G a G IR^, /d G IR}. 
This family also allows for linear-time computation of the 
determinant. It applies radial contractions and expansions 
around the reference point and are thus referred to as radial 
flows. We show the effect of expansions and contractions 
on a uniform and Gaussian initial density using the flows 
(10) and (14) in figure 1. This visualization shows that we 
can transform a spherical Gaussian distribution into a bi- 
modal distribution by applying two successive transforma¬ 
tions. 


Not all functions of the form (10) or (14) will be invert¬ 
ible. We discuss the conditions for invertibility and how to 
satisfy them in a numerically stable way in the appendix. 



Figure 1. Effect of normalizing flow on two distributions. 
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Figure 2. Inference and generative models. Left: Inference net¬ 
work maps the observations to the parameters of the flow; Right: 
generative model which receives the posterior samples from the 
inference network during training time. Round containers repre¬ 
sent layers of stochastic variables whereas square containers rep¬ 
resent deterministic layers. 


4.2. Flow-Based Free Energy Bound 

If we parameterize the approximate posterior distribution 
with a flow of length K, g 0 (z|x) := qxizR), the free en¬ 
ergy (3) can be written as an expectation over the initial 
distribution go(z): 


•^(x) = E,^(^|:r)[logg0(z|x) - logp(x,z)] 

= ®9o(^o) [Ingif(Zif) - logp(x,Zif)] 

= ^ 90 (^ 0 ) [Ingo(zo)] - [logp(x,zx)] 

■ K 

-®9o(^o) y]ln|l + uji/;fc(zfc_i)| . (15) 


.k=l 


Normalizing flows and this free energy bound can be used 
with any variational optimization scheme, including gener¬ 
alized variational EM. For amortized variational inference, 
we construct an inference model using a deep neural net¬ 
work to build a mapping from the observations x to the 
parameters of the initial density go = A/’(/i, cr) (/i G IR^ 
and a G IR^) as well as the parameters of the flow A. 


4.3. Algorithm Summary and Complexity 

The resulting algorithm is a simple modiflcation of the 
amortized inference algorithm for DLGMs described by 
(Kingma & Welling, 2014; Rezende et al., 2014), which 
we summarize in algorithm 1. By using an inference net- 
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Algorithm 1 Variational Inf. with Normalizing Flows 
Parameters: 0 variational, 6 generative 
while not converged do 
X 4- {Get mini-batch} 

Zo ~ qo{»\x) 

zk ^ fK° fx-i o ... o /i(zo) 

J'(x) « T{x,zk) 

A6 oc —VeiJ^(x) 

A<p (X 

end while 


work we are able to form a single computational graph 
which allows for easy computation of all the gradients 
of the parameters of the inference network and the gen¬ 
erative model. The estimated gradients are used in con¬ 
junction with preconditioned stochastic gradient-based op¬ 
timization methods such as RMSprop or AdaGrad (Duchi 
et al., 2010), where we use parameter updates of the form: 

^ (0*, (f/) + r*(g*g, gp, with r is a diago- 
nal preconditioning matrix that adaptively scales the gradi¬ 
ents for faster minimization. 

The algorithmic complexity of jointly sampling and com¬ 
puting the log-det-Jacobian terms of the inference model 
scales as 0{LN‘^) 0{KD), where L is the number of 

deterministic layers used to map the data to the parame¬ 
ters of the flow, N is the average hidden layer size, K is 
the flow-length and D is the dimension of the latent vari¬ 
ables. Thus the overall algorithm is at most quadratic mak¬ 
ing the overall approach competitive with other large-scale 
systems used in practice. 

5. Alternative Flow-based Posteriors 

Using the framework of normalizing flows, we can provide 
a unifled view of recent proposals for designing more flexi¬ 
ble posterior approximations. At the outset, we distinguish 
between two types of flow mechanisms that differ in how 
the Jacobian is handled. The work in this paper considers 
general normalizing flows and presents a method for linear¬ 
time computation of the Jacobian. In contrast, volume¬ 
preserving flows design the flow such that its Jacobian- 
determinant is equal to one while still allowing for rich pos¬ 
terior distributions. Both these categories allow for flows 
that may be flnite or inflnitesimal. 

The Non-linear Independent Components Estimation 
(NICE) developed by Dinh et al. (2014) is an instance of 
a flnite volume-preserving flow. The transformations used 
are neural networks /(•) with easy to compute inverses ^(•) 
of the form: 

f{z) = {zA,ZB + hx{zA)), ( 16 ) 

fl(z') = (za> Zb - hx{z'A)). ( 17 ) 

where z = (zy^, z^) is an arbitrary partitioning of the vec¬ 


tor z and hx is a neural network with parameters A. This 
form results in a Jacobian that has a zero upper triangu¬ 
lar part, resulting in a determinant of 1. In order to build 
a transformation capable of mixing all components of the 
initial random variable zq, such flows must alternate be¬ 
tween different partitionings of z/.. The resulting density 
using the forward and inverse transformations is given by : 

lngif(/i^ o /^_i o ... o /i(zo)) = Ingo(zo), (18) 
Inqxiz') =qo{giog 2 0 ...ogxiz')). (19) 

We will compare NICE to the general transformation ap¬ 
proach described in section 2.1. Dinh et al. (2014) assume 
the partitioning is of the form z = [za = = 

Zd+i:D]- To enhance mixing of the components in the flow, 
we introduce two mechanisms for mixing the components 
of z before separating them in the disjoint subgroups za 
and z B • The first mechanism applies a random permutation 
(NICE-perm) and the second applies a random orthogonal 
transformation (NICE-orth)^. 

The Hamiltonian variational approximation (HVI) devel¬ 
oped by Salimans et al. (2015) is an instance of an in¬ 
flnitesimal volume-preserving flow. Eor HVI, we consider 
posterior approximations g'(z, u;|x) that make use of addi¬ 
tional auxiliary variables u). The latent variables z are inde¬ 
pendent of the auxiliary variables uj and using the change 
of variables rule, the resulting distribution is: g(z', uj') = 

I J|g(z)g(u;), where z',u;' = /(z,u;) using a transforma¬ 
tion /. Salimans et al. (2015) obtain a volume-preserving 
invertible transformation by exploiting the use of such tran¬ 
sition operators in the MCMC literature, in particular the 
methods of Langevin and Hybrid Monte Carlo. This is an 
extremely elegant approach, since we now know that as the 
number of iterations of the transition function tends to in¬ 
finity, the distribution q{z') will tend to the true distribu¬ 
tion p(z|x). This is an alternative way to make use of the 
Hamiltonian inflnitesimal flow described in section 3.2. A 
disadvantage of using the Langevin or Hamiltonian flow 
is that they require one or more evaluations of the likeli¬ 
hood and its gradients (depending in the number of leapfrog 
steps) per iteration during both training and test time. 

6. Results 

Throughout this section we evaluate the effect of using nor¬ 
malizing flow-based posterior approximations for inference 
in deep latent Gaussian models (DLGMs). Training was 
performed by following a Monte Carlo estimate of the gra¬ 
dient of an annealed version of the free energy (20), with 
respect the model parameters 9 and the variational param¬ 
eters (j) using stochastic backpropoagation. The Monte 

^ Random orthogonal transformations can be generated by 
sampling a matrix with independent unit-Gaussian entries Aij ~ 
A/^(0,1) and then performing a QR-factorization. The resulting 
Q-matrix will be a random orthogonal matrix (Genz, 1998). 
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Table 1. Test energy functions. 
Potential U{z) 
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with r(;i(z) = sin r(; 2 (z) = 3e 2 [ o.e J ^ 

wsjz) — 3a (^i^) and a(x) — 1/(1 + e~^). 


Carlo estimate is computed using a single sample of the 
latent variables per data-point per parameter update. 

A simple annealed version of the free energy is used since 
this was found to provide better results. The modified 
bound is: 



3 IfSS 



K = 2 K = 8 K = 32 



K = 2 K = 8 K = 32 



(a) (b) Norm. Flow 


(c) NICE 




(d) Comparison of KL-divergences. 


ZK = fK° fK-l o . . . o /l(z) 

= E,o(zq) [Inp^(zif) - logp(x,Zif)] 

= Eqo(Zo) [ln®(zo)] - /?tEqo(Zo) [logp(x,Zx)] 

■ K 

-Eqo(Zo) (20) 

_/c=l 


where pt ^ [0,1] is an inverse temperature that follows a 
schedule Pt = min(l, 0.01 +1/10000), going from 0.01 to 
1 after 10000 iterations. 


The deep neural networks that form the conditional prob¬ 
ability between random variables consist of determinis¬ 
tic layers with 400 hidden units using the Maxout non¬ 
linearity on windows of 4 variables (Goodfellow et al., 
2013) . Briefly, the Maxout non-linearity with window- 
size A takes an input vector x G IR^ and computes: 
Maxout(x)/c = max^^{A/c,A(/c+i)} for /c = 0 ... d/A. 

We use mini-batches of 100 data points and RMSprop 
optimization (with learning rate = 1 x 10“^ and 

momentum =0.9) (Kingma & Welling, 2014; Rezende 
et al., 2014). Results were collected after 500,000 parame¬ 
ter updates. Each experiment was repeated 100 times with 
different random seeds and we report the averaged scores 
and standard errors. The true marginal likelihood is esti¬ 
mated by importance sampling using 200 samples from the 
inference network as in (Rezende et al., 2014, App. E). 


6.1. Representative Power of Normalizing Flows 

To provide an insight into the representative power of den¬ 
sity approximations based on normalizing fiows, we pa¬ 
rameterize a set of unnormalized 2D densities p(z) (x 
exp[—f/(z)] which are listed in table 1. 

In figure 3(a) we show the true distribution for four cases. 


Figure 3. Approximating four non-Gaussian 2D distributions. 
The images represent densities for each energy function in table 
1 in the range (—4,4)^. (a) True posterior; (b) Approx poste¬ 
rior using the normalizing flow (13); (c) Approx posterior using 
NICE (19); (d) Summary results comparing KL-divergences be¬ 
tween the true and approximated densities for the first 3 cases. 

which show distributions that have characteristics such as 
multi-modality and periodicity that cannot be captured with 
typically-used posterior approximations. 

Figure 3(b) shows the performance of normalizing fiow 
approximations for these densities using fiow lengths of 
2, 8 and 32 transformations. The non-linearity h{z) = 
tanh(z) in equation (10) was used for the mapping and 
the initial distribution was a diagonal Gaussian, go(z) = 
A/’(z|/Lt, cr^I). We see a substantial improvement in the ap¬ 
proximation quality as we increase the fiow length. Fig¬ 
ure 3(c) shows the same approximation using the volume¬ 
preserving transformation used in NICE (Dinh et al., 2014) 
for the same number of transformations. We show sum¬ 
mary statistics for the planar fiow (13), and NICE (18) for 
random orthogonal matrices and with random permutation 
matrices in 3(d). We found that NICE and the planar fiow 
(13) may achieve the same asymptotic performance as we 
grow the flow-length, but the planar fiow (13) requires far 
fewer parameters. Presumably because all parameters of 
the fiow (13) are learned, in contrast to NICE which re¬ 
quires an extra mechanism for mixing the components that 
is not learned but randomly initialized. We did not observe 
a substantial difference between using random orthogonal 
matrices or random permutation matrices in NICE. 

6.2. MNIST and CIFAR-10 Images 

The MNIST digit dataset (LeCun & Cortes, 1998) contains 
60,000 training and 10,000 test images of ten handwritten 
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Figure 4. Effect of the flow-length on MNIST. 


Table 2. Comparison of negative log-probabilities on the test set 


for the binarised MNIST data. 

Model 

— Inp(x) 

DLGM diagonal covariance 

< 89.9 

DLGM+NF (k = 10) 

< 87.5 

DLGM+NF (k = 20) 

< 86.5 

DLGM+NF (k = 40) 

< 85.7 

DLGM+NF (k = 80) 

< 85.1 

DLGM+NICE (k = 10) 

< 88.6 

DLGM+NICE (k = 20) 

< 87.9 

DLGM+NICE (k = 40) 

< 87.3 

DLGM+NICE (k = 80) 

< 87.2 

Results below from (Salimans et al., 2015) 


DLGM + HVI (1 leapfrog step) 

88.08 

DLGM + HVI (4 leapfrog steps) 

86.40 

DLGM + HVI (8 leapfrog steps) 

85.51 

Results below from ( Gregor et al, 2014) 


DARN rih = 500 

84.71 

DARN rih — 500, adaNoise 

84.13 


Table 3. Test set performance on the CIFAR-10 data. 


K = 0 

K = 2 


K = 10 

— Inp(x) -293.7 

-308.6 

-317.9 

-320.7 


but with 30 latent variables. Since this data is non-binary, 
we use a logit-normal observation likelihood, p(x|/Lt, a) = 


n 


Ar(logit(XO|/Lt,,aO 


, where logit(x) = log We sum- 


Xi(l-Xi) 

marize the results in table 3 where we are again able to 
show that an increase in the flow length K systematically 
improves the test log-likelihoods, resulting in better poste¬ 
rior approximations. 


7. Conclusion and Discussion 


In this work we developed a simple approach for learn¬ 
ing highly non-Gaussian posterior densities by learning 
transformations of simple densities to more complex ones 
through a normalizing flow. When combined with an amor¬ 
tized approach for variational inference using inference 
networks and efficient Monte Carlo gradient estimation, we 
are able to show clear improvements over simple approxi¬ 
mations on different problems. Using this view of normal¬ 
izing flows, we are able to provide a unified perspective of 
other closely related methods for flexible posterior estima¬ 
tion that points to a wide spectrum of approaches for de¬ 
signing more powerful posterior approximations with dif¬ 
ferent statistical and computational tradeoffs. 


digits (0 to 9) that are 28 x 28 pixels in size. We used the 
binarized dataset as in (Uria et al., 2014). We trained differ¬ 
ent DLGMs with 40 latent variables for 500, 000 parameter 
updates. 

The performance of a DLGM using the (planar) nor¬ 
malizing flow (DLGM-i-NF) approximation is com¬ 
pared to the volume-preserving approaches using NICE 
(DLGM-fNICE) on exactly the same model for different 
flow-lengths K, and we summarize the performance in fig¬ 
ure 4. This graph shows that an increase in the flow-length 
systematically improves the bound T, as shown in figure 
4(a), and reduces the KL-divergence between the approx¬ 
imate posterior q{z\x.) and the true posterior distribution 
p(z|x) (figure 4(b)). It also shows that the approach us¬ 
ing general normalizing flows outperforms that of NICE. 
We also show a wider comparison in table 2. Results are 
included for the Hamiltonian variational approach as well, 
but the model specification is different and thus gives an 
indication of attainable performance for this approach on 
this data set. 

The CIFAR-10 natural images dataset (Krizhevsky & Hin¬ 
ton, 2010) consists of 50,000 training and 10,000 test RGB 
images that are of size 3x32x32 pixels from which we ex¬ 
tract 3x8x8 random patches. The color levels were con¬ 
verted to the range [e, 1 — e] with e = 0.0001. Here we 
used similar DLGMs as used for the MNIST experiment. 


An important conclusion from the discussion in section 3 
is that there exist classes of normalizing flows that allow us 
to create extremely rich posterior approximations for vari¬ 
ational inference. With normalizing flows, we are able to 
show that in the asymptotic regime, the space of solutions 
is rich enough to contain the true posterior distribution. If 
we combine this with the local convergence and consis¬ 
tency results for maximum likelihood parameter estimation 
in certain classes of latent variables models (Wang & Tit- 
terington, 2004), we see that we are now able overcome the 
objections to using variational inference as a competitive 
and default approach for statistical inference. Making such 
statements rigorous is an important line of future research. 

Normalizing flows allow us to control the complexity of the 
posterior at run-time by simply increasing the flow length 
of the sequence. The approach we presented considered 
normalizing flows based on simple transformations of the 
form (10) and (14). These are just two of the many maps 
that can be used, and alternative transforms can be designed 
for posterior approximations that may require other con¬ 
straints, e.g., a restricted support. An important avenue of 
future research lies in describing the classes of transforma¬ 
tions that allow for different characteristics of the posterior 
and that still allow for efficient, linear-time computation. 

Ackowledgements: We thank Charles Blundell, Theo- 
phane Weber and Daan Wierstra for helpful discussions. 
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A. Invertibility conditions 

We describe the constraints required to have invertible 
maps for the planar and radial normalizing flows described 
in section 3. 

A.l. Planar flows 

Functions of the form (10) are not always invertible de¬ 
pending on the non-linearity and parameters chosen. When 
using h{x) = tanh(x), a sufficient condition for /(z) to be 
invertible is that w^u > —1. 

This can be seen by splitting z as a sum of a vector z_\_ per¬ 
pendicular to w and a vector zy, parallel to w. Substituting 
z = zx + Z|| into (10) gives 

/(z) = Zx + Z|| + u/i(w^Z|| + b). (21) 

This equation can be solved for z± given zy and y = /(z), 
having a unique solution 

z_\_ = y — z\\ — u/i(w^zy -f- b). (22) 

The parallel component can be further expanded as zy = 
Of ||.^| 2 , where a G IR. The equation that must be solved 
for a is derived by taking the dot product of (21) with w, 
yielding the scalar equation 

w^/(z) = a-|-w^u/i(a + 6). (23) 

A sufficient condition for (23) to be invertible w.r.t a is that 
its r.h.s a + w^uh{a + b) to be a non-decreasing function. 
This corresponds to the condition l-\-w^uh'{a-\-b) > 0 = 
w^u > Since 0 < h' {a b) < 1, it suffices to 

have w^u > — 1. 

We enforce this constraint by taking an arbitrary vec¬ 
tor u and modifying its component parallel to w, pro¬ 
ducing a new vector u such that w^u > —1. The 
modified vector can be compactly written as u(w, u) = 
u+ [m(w^u) — (w^u)] ||.^| 2 , where the scalar function 
m{x) is given by m{x) = — 1 + log(l + e^). 

A.2. Radial flows 

Functions of the form (14) are not always invertible de¬ 
pending on the values of a and (3. This can be seen by 
splitting the vector z as z = zq + rz, where r = |z — zo|. 
Replacing this into (14) gives 

/(z) = Zo+rz+ /3^^. (24) 

a + r 

This equation can be uniquely solved for z given r and y = 

/(z)> 

y - zq 

r (l + 


To obtain a scalar equation for the norm r, we can subtract 
both sides of (24) and take the norm of both sides. This 
gives 

|y-zo|=r(n -• (26) 

V a + r J 

A sufficient condition for (26) to be invertible is for its r.h.s. 
^ (^1 + lo be a non-decreasing function, which im¬ 
plies [3 > — . Since r > 0, it suffices to impose 

P > —a. This constraint is imposed by reparametrizing /3 
as = —a + m{P), where m{x) = log(l + e^). 


z = 


(25) 










